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Abstract 

In this paper we prove the optimahty of an aggregation procedure. We prove lower bounds 
for aggregation of model selection type of M density estimators for the KuUback-Leiber 
divergence (KL), the Hellinger's distance and the Li-distance. The lower bound, with 
respect to the KL dista nce, can be achieved by the on-line type estimate suggested, among 
others, bv lYand lj20fl(ll) . Com bining these resul ts, we state that logM/n is an optimal rate 
of aggregation in the sense of lTsvba,kovlll2nn,'^ . where n is the sample size. 
Keywords: Aggregation, optimal rates, Kullback-Leiber divergence. 
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1. Introduction 



Let {X,A) be a measurable space and u he & cr-finite measure on {X,A). Let = 
(Xi, . . . ,Xn) be a sample of n i.i.d. observations drawn from an unknown probability of 

density f on X with respect to u. Consider the estimation of / from D„. 

Suppose that we have M > 2 different estimators f i , . . . , f Ad of f. (^atonil 
Yand J2nn n').'Nemirovski' fenOCf) . '.Tuditskv and Nemirovskil pnnm .lTe 



iTsvbakovl (l2nn.'j 'l. lCatonl 



20041 ) and Rigollet and Tsybakov (2004) have studied the problem of model selection type 
aggregation. It consists in construction of a new estimator /„ (called aggregate) which is 
approximatively at least as good as the best among /i, . . . , Jm- In most of these papers, this 
problem is solved by using a kind of cross-validation procedure. Namely, the aggregation 
is based on splitting the sample in two independent subsamples and Df of sizes m and 
/ respectively, where m ^ / and m + I = n. The size of the first subsample has to be 
greater than the one of the second because it is used for the true estimation, that is for 
the construction of the M estimators /i, . . . ,/m- The second subsample is used for the 
adaptation step of the procedure, that is for the construction of an aggregate /„, which has 
to mimic, in a certain sense, the behavior of the best among the estimators /«. Thus, /„ is 
measurable w.r.t. the whole sample Dn unlike the first estimators /i, . . . , /m- 

One can suggest different aggregation procedures and the question is how to look for an 
optimal one. A way to d efine optini a lity in aggregation in a minimax sense for a regression 
problem is suggested in Tsvbakov (|200,'j ). Based on the same principle we can define 
optimality for density aggregation. In this paper we will not consider the sample splitting 
and concent r ate on l y on the adaptation step, i.e . on th e constructi on of aggregates (following 
Nemirovskil JiooJ, I.Tuditskv and Nemirovskl (l200nh . iTsvbakovl (>200a, )). Thus, the first 
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subsample is fixed and instead of estimators /i, . . . , Jm, we liave fixed functions /i, . . . , Jm- 
Rather than working with a part of the initial sample we will use, for notational simplicity, 
the whole sample Dn of size n instead of a subsample Df 



The aim of this paper is to prove the optimality, in the sense of lTsvbakovl diooi), of 
the aggregation method proposed by Yang, for the estimation of a density on (M'^, A) where 
A is the Lebesgue measure on R'^. This procedure is a convex aggregation with weights 
which can be seen in two different ways. Yang's point of view is to express these weights in 
function of the likelihood of the model, namely 



M 



(1) 



where the weights are w 



(n) 



(n + 1) 



-1 



l^k=o % ' and 



w 



(k) ^ . ■ ■ fjjXk) 



\/k 



1, . . . ,n and wf' = 



M 



(2) 



And the second poin t of view is to write these wei g hts as exponentia l ones , as used in 
Autrustin et al.Nl997h.lCatonil ()2nn4h . lHartig:anl tod l. lBunea and Nobell (l2nn,^ ^. ljuditskv et al 
(j2nnfJ ) and Lecuel (|2nnfJ ). foi~different statistical models. Define the empirical Kullback loss 
Kn{f) = "(l/'^) Yll=i log/(^j) (keeping only the term independent of the underlying den- 
sity to estimate) for all density /. We can rewrite these weights as exponential weights: 



w 



(k) ^ exp{-kKk{fj)) 

Y.iii'^^Pi-kKkifi))' 



yk = 0, 



, n. 



Most of the results on convergence properties of aggregation met hods ar e obtained fo r 
the regression and the ga u ssian white n oise models. Neve rtheless, Catonil (Il997l. 20041) . 
Devrove and Lugosil (l200ll 'l. lYant3 jiooO) , [zhan^ \2m± and lEigollet and Tsvbakwl \2mi ) 
have explored the performances of aggregation procedures in the density estimation frame- 
work. Most of them have established upper bounds for some procedure and do not deal 
with the problem of optimality of their procedures. To our knowledge, lower bounds 
for the performance o f aggr egation methods in density estimation are available only in 
Rigol let and Tsvbakov'('2004'). Thei r results are obtained with respect to the mean squared 
risk. ICatoni (1997 ) and Yana (l200nh construct procedures and give convergence rates w.r.t. 
the KL loss. One aim of this paper is to prove optimality of one of these procedures w.r.t. 
the KL loss. Lower boun ds w.r.t. the Helli nger's distance and Li -dist ance (stated in Sec- 
tion and some results of Birgd (|2004h and Devr ove and L ugosi ('2001') (recalled in Section 
HJ suggest that the rates of convergence obtained in Theorem |2l and 0] are optimal in the 
sense given in Definition ^ In fact, an approximate bound can be achieved, if we allow the 
leading term in the RHS of the oracle inequality (i.e. in the upper bound) to be multiplied 
by a constant greater than one. 

The paper is organized as follows. In Section |2l we give a Definition of optimality, for a 
rate of aggregation and for an aggregation procedure, and our main results. Lower b ounds, 
for dif ferent loss functions, are given in Section |31 In Section |1J we recall a result of lYand 
(|200flh about an exact oracle inequality satisfied by the aggregation procedure introduced 
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2. Main definition and main results 



To evaluate the accuracy of a density estimator we use the Kullback-Leiber (KL) divergence, 
the HeUinger's distance and the Li-distance as loss functions. The KL divergence is defined 
for all densities /, g w.r.t. a cr— finite measure v on a, space X, by 



K{f\9) 



+00 



otherwise, 



where Pf (respectively Pg) denotes the probability distribution of density / (respectively 
g) w.r.t. V. HeUinger's distance is defined for all non-negative measurable functions / and 
9 by 

1 /2 

where the Lo-norm is defined by ||/||2 = (/^^ f{x)du{x)) ^ for all funct ions / G L2{X^ u). 
The Li-distance is defined for all measurable functions / and g by 



vif,9) 



\f-9\di^- 



X 



The main goal of this paper is to find optimal rate of aggregation in the sense of the 
definition given below. Thi s definition is an analog, for the density estimation problem, of 
the one in iTsvbakov for the regression problem. 

Definition 1 Take M > 2 an integer, T a set of densities on {X,A,iy) and J-q a set of 
functions on X with values in M such that J- ^ J-q- Let d he a loss function on the set Tq. 
A sequence of positive numbers {ipn{M))n&i* is called optimal rate of aggregation of 
M functions in {Tq,T) w.r.t. the loss d if : 

(i) There exists a constant C < 00, depending only on Tq^T and d, such that for all 
functions fi, . . . , fM in J^o there exists an estimator fn ( aggregate ) of f such that 



sup 



E 



d{f,fn) 



min d(f, f.: 
i=l,...,M 



< C^n{M), Vn G N*. 



(3) 



(ii) There exist some functions /i , . . . , in and c > a constant independent of M 
such that for all estimators fn of f , 



sup 



E 



difjn) 



1=1,. ..,M 



> cipniM), Vn G N*. 



(4) 



Moreover, when the inequalities ^ and ^ are satisfied, we say that the procedure fn, 
appearing in is an optimal aggregation procedure w.r.t. the loss d. 

Let ^ > 1 be a given number. In this paper we are interested in the estimation of densities 
lying in 

J^{A) = {densities bounded by A} (5) 



and, depending on the used loss function, we aggregate functions in J^q which can be: 
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1. Tk{A) = {densities bounded by A} for KL divergence, 

2. J^h{^) = {non-negative measurable functions bounded by A} for Hellinger's distance, 

3. J-'v{A) = {measurable functions bounded by A} for the Li-distance. 

The main result of this paper, obtained by using Theorem[Hland assertion © of Theorem 
13 is the following Theorem. 

Theorem 1 Let A > 1. Let M andn be two integers such thatlogAI < 16(min(l, ^— l))'^n. 
The sequence 

n 

is an optimal rate of aggregation of M functions in {J^KiA),J^iA)) (introduced in ^) 
w.r.t. the KL divergence loss. Moreover, the aggregation procedure with exponential weights, 
defined in achieves this rate. So, this procedure is an optimal aggregation procedure 
w.r.t. the KL-loss. 

Moreover, observing Theorem IHl and the result of IPevrove and Lugosil (j2nnil ) (recalled 
at the end of Section Q , the rates obtained in Theorems [21 and |11 

logM\ 2 



n 



are near optimal rate of aggregation for the Hellinger's distance and the Li-distance to the 
power q, where g > 0, if we allow the leading term "minj=i^,,,^M /i)" to be multiplied 
by a constant greater than one, in the upper bound and the lower bound. 

3. Lower bounds 

To prove lower bounds of type @ we use the following le mma on minimax lower bounds 



which can be obtained by combining Theorems 2.2 and 2.5 in lTsvbakovl (j2i]QJ). We say that 



d is a semi-distance on B if d is symmetric, satisfies the triangle inequality and d{9, 9) = 0. 

Lemma 1 Let d he a semi-distance on the set of all densities on [X v) and w he a non- 
decreasing function defined on ]R_(_ which is not identically 0. Let {ipn)n&n be a sequence of 
positive numbers. Let C he a finite set of densities on (JY,A, v) such that card{C) = M >2, 

yf^geC, d(/,<7) >4V'n >0, 

and the KL divergences K{Pj'^\P^'^), hetween the product prohahility measures correspond- 
ing to densities f and g respectively, satisfy, for some fo £ C, 

V/ G C, KiPf^lPfJ') < (l/16)log(M). 

Then, 

infsupE/ w{'4)~^ d{fn, f)) > ci, 
fn fee L J 

where inf^- denotes the infimum over all estimators based on a sample of size n from an 
unknown distribution with density f and ci > is an absolute constant. 
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Now, we give a lower bound of the form @ for the three different loss functions intro- 
duced in the beginning of the section. Lower bounds are given in the problem of estimation 
of a density on M°', namely we have X = R'^ and u is the Lebesgue measure on R . 

Theorem 2 Let M be an integer greater than 2, A > 1 and q > two numbers. We have 
for all integers n such that log Af < 16(min (1, A — l))'^n, 



sup inf sup 



H{fnjy - . mm Hif.jy 
J J=l,-,M 



n 



where c is a positive constant which depends only on A and q. The sets !F{A) and !Fh{-^) 
are defined in when X = W^ and the infimum is taken over all the estimators based on 
a sample of size n. 

Proof : For all densities /i, . . . , /a/ bounded by A we have, 



sup inf sup 



Hifnjy 



mm mf.JY 

3=1, ...,M 



> inf sup Kf 

fn /e{/l,...,/A/} 



HifnJ)" 



Thus, to prove Theorem 1, it suffices to find M appropriate densities bounded by A and to 
apply Lemma 1 with a suitable rate. 

We consider D the smallest integer such that 2^^^ > M and A = {0,1}^. We set 
h,{y) = h{y-{j- l)/D) for all y € M, where h{y) = {L/D)g{Dy) and g{y) = I[o,i/2](y) " 
2(1/2,1] (y) for all y G M and L > will be chosen later. We consider 



D 



Vx = (xi, ...,Xd)£ 



for all 5 = {6i, ... , 6d) G A. We take L such that L < Dmin(l, A - 1) thus, for all 6 £ A, 
fs is a density bounded by A. We choose our densities /i, . . . , /m in i3 = {fs ■ S G A}, 
but we do not take all of the densities of B (because they are too close to each other), 
but only a subset of B, indexed by a separated set (this is a set where all the points are 
separated from each other by a given distance) of A for the Hamming distance defined 



by P{S\5') = Ei=ili^} + ^1) for all = (<5|, . . . 5},), 
Jjg hd\ = 0, we have 



,6%) G A. Since 



D 



j = l D 



l/D 



2p(<5^52^ 



(l - ^l + /i(x)) dx, 



for all 6^ = {6l, . . . ,0]^ 
1 — ax'^ — \/l + X, where a 



= {6f, . . . , (5|)) G A. On the other hand the function ip{x) = 
8~^/^, is convex on [—1, 1] and we have |/i(x)| < L/D < 1 so, 
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according to Jensen, Lp{h{x))dx > f (^Jq h{x)dx^ . Therefore J^^^ ^1 — y^l + h{x)^ dx > 
a J^^^ h^{x)dx = {aLp')/D^ , and we have 



for all 5^,5"^ £ A. Acco r ding to Varshamov-Gilbert, cf. IXsybakov ( 20041 . p. 89) or 



Thra.imov and ^..^^^ there exists a I^/S-separatell^iy^lell^/;, on A for 

the Hamming distance such that its cardinal is higher than 2^/® and (0, . . . , 0) € Nqj^. On 



the separated set Nj^i^ we have, 



In order to apply Lemma ^ we need to control the KL divergences too. Since we have 
taken N^/g such that (0, ...,0) € N£,/g, we can control the KL divergences w.r.t. Pq, 
the Lebesgue measure on [0, l]'^. We denote by Ps the probability of density fs w.r.t. the 
Lebesgue's measure on W^, for all 5 G A. We have, 



K(pf>-\P^^) = nf \og{fs{x)) h{x)dx 

/ ^ log {1 + 6 jhj{x)) {I + 5 jhj{x))dx 



n 

j=l " 'c" 




1/D 

log(l + h{x)){l + h{x))dx, 







for all 6 = {6i, . . . ,6£)) £ -/V^j/g. Since \/u > —1, log(l + n) < n, we have, 

^1/^5 rl/D ^^2 



K{Pf^\P^'^) < n J + Hx)Mx)dx <nD J h\x)dx 



L»2 • 

Since logM < 16(min(l, A — we can take L such that {nLp')/D'^ = log(M)/16 

and still having L < i:>min(l,^ - 1). Thus, for L = (L>/4) ^/log (M)/n, we have for all 



elements 5\6^ in N^/s, H\fsi,fs2) > (a/64)(log(M)/n) and V5 G N^/s , K{Pf'''\P^^^ 







< 



(l/16)log(M). 

Applying Lemma 1 when d is H, the Hellinger's distance, with M densities /i, . . . , /m 
in I/5 : (5 G Nj^j^^ where /i = I[o,i]d and the increasing function w{u) = u'^, we get the 
result. 



Remark 1 The const ruction of the family of densities { f a : ^ G Nn /« } is in the same spirit 
as the lower bound of Tsvbakovi \200d ). \Riaollet and Tsubakoil ^koOi ) but, as compared to 



Riaollet and Tsubakoi \200A) . we co nsider a different problem (model selection aggregation) 
and, as compared to^Tsub akoi\ \200,4 ). we study a different model (density estimation). Also, 
our risk function is different from those considered in these papers. 
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Now, we give a lower bound for KL divergence. We have the same residual as for square 
of Hellinger's distance. 

Theorem 3 Let M > 2 be an integer, A > 1 and q > 0. We have, for any integer n such 
that logAf < 16(min(l,A- l)fn. 



sup inf sup 



E/ 



and 



sup inf sup 

/l,-,/A/eJ^if(A) fn f'^HA) 



E 



/ 



{K{f\f^)f - . min 

j=l,...,M 



{KUn\f)y - . inin 

J j=l,...,M 



n 



>c|i5£^V, (7) 



n 



where c is a positive constant which depends only on A. The sets TiA) and J-k{A) are 
defined in ^ for X = W^. 

Proof : Proof of the inequality of Theorem |31 is similar to the one for Q . Since we 
have for all densities / and g, 

K{f\g)>H\f,g), 

(a proof is given in Tsvbakov . 20041 . p. 73), it suffices to note that, if /i, . . . , /m are densities 
bounded by A then, 



sup inf sup 

fu-jM&rK{A) fn fer(A) 



E/ 



{KU\fn)r - . min 

J j=l,...,M 



> inf sup 

k fe{h,...,fM} 



E, 



{K{f\fn)r 



> inf sup 

fn /e{/l,...,/M} 



E, 



H^'{f,fn 



to get the result by applying Theorem |2j 



With the same method as Theorem 1, we get the result below for the Li-distance. 



Theorem 4 Let M > 2 be an integer, A > 1 and q > 0. We have for any integers n such 
that logM < 16(min(l,A- l)fn. 



sup inf sup 

/i,.-,/MeJ^4A) fn f&HA) 



E, 



V{f, fn 



min vi f, fiY 
j=l,...,M ^■''■"^ 



> C 



log M \ 



n 



where c is a positive constant which depends only on A. The sets J^{A) and J^v{A) are 
defined in ^forX = R'^. 

Proof : The only difference with Theorem |2l is in the control of the distances. With the 
same notations as the proof of Theorem [2 we have, 



vifs^Js^) 



[0,1]' 



\fsi{x)-fs2ix)\dx = p{6',6') I \hix)\dx = ^p{6',6') 



l/D 
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for all 5^,5'^ G A. Thus, for L = {D /A)^/\og{M)/n and iV^j/g, the D/8-separated set of A 
introduced in the proof of Theorem [21 we have, 

^(/5i,/52)>^y^^^, 'i5\5''^No/s^ndK{Pf^\Pl>^)<^\og{M), V<5 G A. 

Therefore, by applying Lemma 1 to the Li-distance with M densities /i, . . . , Jm in {/^ : 5 G N^, /§} 
where /i = lI[o,i]<i and the increasing function w{u) = li"^, we get the result. 



4. Upper bounds 

In this section we use an argument in lYand (see also ICatonil . 12004 ) to show that the 

rate of the lower bound of Theorem |31 is an optimal rate of aggregation with respect to the 
KL loss. We use an aggregate constructed by Yang (defined in (^) to attain this rate. An 
upper bound of the type Q is stated in the following Theorem. Remark that Theorem [S] 
holds in a general framework of a measurable space {X, A) endowed with a cr-finite measure 



V. 



Theorem 5 (Yang) Let Xi, . . . ,X„ he n observations of a probability measure on {X,A) 
of density f with respect to u. Let /i, . . . , /m be M densities on {X,A, v). The aggregate 
fn, introduced in satisfies, for any underlying density f , 



(8) 



Proof : Proof follows the line of lYand (|2000l ) . although he does not state the result in 
the form for convenience we reproduce the argument here. We define fk{x; X^^^) = 
^jLi 'Wj''^ fj{^)j V/c = 1, . . . , n (where Wj^^ is defined in Q and x*^'^^ = (xi, . . . , Xk) for all 
k £ N and xi,...,Xk e X) and fo{x;X^^^) = (1/M) J^fLi fji^) for ah x e X. Thus, we 
have 



1 

/„(x;X(")) = -— ^A(x;X 

^ k=0 



Let / be a density on {X, A, u). We have 



n 



k=0 



E 

A:=0 



log 



f{Xk+l) 



fe+1 



fk[xk+i;x('')) J .^^ 



n+l 



f{xi) . . . f{Xn+l) 



Jl f{xi)di 



(Xi, . . . ,Xn+l) 



' i=l 
n+l 



log 



llf{x,)dl^'^^^+'\x,,...,Xn+l) 
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but lll^Qfk{xk+i;x^'''i) = {l/M)J2f=ifj{xi)---fj{xn+i),yxi,...,Xn+i G X thus, 



n 



A;=0 



f{xi) ■ ■ ■ fiXn+l) 



n+1 



Ylf=lfj{xi)---fj{Xn+l) 



llf{x,)dl.'^^^+'\x,,...,Xn+l), 



M 



i=l 



moreover x i — > log(l/a;) is a decreasing function so, 

f{xi) . ..f{Xn+l) 



5^Ey [K{f\h) 



k=0 







< min < 




J j=l,...,M 





n+1 



<logM+ . min <; / log( t )'^\- ' ' ■'::"^'\ ] H f{x,)du^^^+'\xu ■ ■ ■ ,Xn+i] 
finally we have. 



n+1 



l[f{x,)d,^^^^+'\xu...,Xn+l) 



i=l 



fjxi) ■ ■ ■ fjXn+l) 
fj{xi)...fj{Xn+l), .^^ 



n 

Y,^f [Kim) 



k=0 



<logM+(n+l) . inf 

j=l,...,M 



(9) 



On the other hand we have. 



Kiflfn 



log 



fix^ 



n+1; 



n+1 



1 s:^n 

. n+1 



TJk=ofk{Xn+i;X^^)) ) .^^ 



n/(^.)d^''^"+'^(^l,---,^n+l), 



and X I — > log(l/x) is convex, thus, 



J n+1 L 



(10) 



fc=0 



Theorem 13 foUows by combining Q and (|lfl|) . 



Birge constructs estimators, called T-estimators (the "T" is for "test"), which are 
adaptive in aggregation selection model of M estimators with a residual proportional at 
(log M/n )*^^^ when He llinger and Li-distances are used to evaluate the quality of estima- 
tion (cf. birgel (20o3))- But it does not give an optimal result as Yang, because there 
is a constant greater than 1 in front of the main term mmi=i^^^^^M d'^if, fi) where d is the 
Hellinger distance or the Li distance. Nevertheless, observing the proof of Theorem [21 and 
01 we can obtain 



sup inf sup 



E 



difjn 



C{q), mm difjif 

1=1,. ..,M 



n 



where d is the Hellinger or Li-distance, g > and A> \. The constant C{q) can be chosen 
equal to the one appearing in the following Theorem. The same residual appears in this 
lower bound and in the upper bounds of Theorem (HI so we can say that 

logMy/^ 
n J 

is near optimal rate of aggregation w.r.t. the Hellinger distance or the -Li-distance to the 
power q. We recall Birge's results in the following Theorem. 
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Theorem 6 (Birge) If we have n observations of a probability measure of density f w.r.t. 
V and fi,...,fM densities on (XjAji'), then there exists an estimator fn ( T-estimator) 
such that for any underlying density f and q > 0, we have 



E 



f 



HifJnY < C{q) \ . mill, ^ ( i^y^' 



n 



^j=l,...,M 

and for the Li-distance we can construct an estimator fn which satisfies : 



E 



v{f, fuY < C{q) ( . mill, ^ v{f, f,y + ( ''^ 



^j=l,...,M 

where C{q) > is a constant depending only on q. 



n 



An other resuh, which can be found in Devrove and Lugosil (EoOl ) , states that the 
minimum distance estimate proposed by Yatracos (1985) (cf. Devrove and Lugosil (2001 



p. 59)) achieves the same aggregation rate as in Theorem |H1 for the Li-distance with q = 1. 
Namely, for all /, /i, . . . , /m G J^{A), 



E/ 



v{f, fn) < 3 niin fj) + 

J j=l,...,M 



logM 



n 



where /„ is the estimator of Yatracos defined by 



fn = arg min sup 

/G{/i,-,/m} Aey4 

and A = {{x: fi{x) > fj{x)] : 1 < i, j < M} . 



^ i=l 



&A} 
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